ATOM Documentation

← Back to App

# GraphRAG Idempotency Issue Analysis & Fix

## Problem Report

### Symptom

"No new entities created with Outlook ingestion"

### Root Cause

**Entities WERE being created, but duplicates were being created on every ingestion!**

The bug was in graphrag_engine.py line 1156-1165 (before fix):

# Deduplicate logic simplified:  ← COMMENT WAS LYING!
node = GraphNode(
    workspace_id=workspace_id,
    name=name,
    type=e_data.get("type", "unknown"),
    ...
)
session.add(node)  # ← Just blindly adds, NO CHECK!

**Impact:**

- 1st ingestion of email → Creates GraphNode("Test Subject", type="email")

- 2nd ingestion of same email → Creates ANOTHER GraphNode("Test Subject", type="email")

- Nth ingestion → N duplicates! ❌

### Why This Matters

1. **Database Bloat** - Graph explodes with duplicate nodes

2. **Performance Degradation** - Queries get slower as node count grows

3. **Incorrect Analytics** - Entity counts are meaningless

4. **Relationship Chaos** - Duplicate nodes create messy relationship webs

## Solution Implemented

### Fix Applied (graphrag_engine.py lines 1155-1185)

# Check if node already exists (workspace_id, name, type)
existing = (
    session.query(GraphNode)
    .filter_by(
        workspace_id=workspace_id,
        name=name,
        type=e_data.get("type", "unknown")
    )
    .first()
)

if existing:
    # Update existing node
    existing.description = e_data.get("description", existing.description)
    existing.properties.update(properties)
    node_id = existing.id
    logger.debug(f"Updated existing node: {name} ({existing.type})")
else:
    # Create new node
    node = GraphNode(
        workspace_id=workspace_id,
        name=name,
        type=e_data.get("type", "unknown"),
        description=e_data.get("description", ""),
        properties=properties,
    )
    session.add(node)
    session.flush()
    node_id = node.id
    logger.debug(f"Created new node: {name} ({e_data.get('type', 'unknown')})")

node_map[name] = node_id

### What Changed

**Before:**

- Always created new nodes (duplicates)

- No check for existing entities

- Properties never updated

**After:**

- Checks if node exists (workspace_id, name, type)

- If exists: UPDATE description and properties

- If not: CREATE new node

- Proper logging (debug level)

## Critique of Original Idempotency Plan

### ✅ Good Ideas (Should Still Implement)

1. **Content Hashing** - Track if entity actually changed before updating

2. **source_ids JSONB** - Track which documents contributed to an entity

3. **Unique Constraints** - Add database-level uniqueness:

```sql

CREATE UNIQUE INDEX ix_graph_nodes_unique

ON graph_nodes (workspace_id, name, type)

WHERE workspace_id IS NOT NULL;

```

4. **ON CONFLICT Upserts** - For better performance:

```python

# Current (check-then-insert in Python)

existing = session.query(GraphNode).filter_by(...).first()

if existing:

existing.description = ...

else:

session.add(GraphNode(...))

# Better (ON CONFLICT in Postgres)

# Single round-trip, atomic, faster

insert_stmt = text("""

INSERT INTO graph_nodes (workspace_id, name, type, description, properties)

VALUES (:workspace_id, :name, :type, :description, :properties)

ON CONFLICT (workspace_id, name, type)

DO UPDATE SET

description = EXCLUDED.description,

properties = graph_nodes.properties || EXCLUDED.properties

""")

```

### ❌ Issues with Original Plan

1. **Missing Multi-Workspace Support**

- Plan doesn't add tenant_id to GraphNode/GraphEdge

- But we just implemented multi-workspace for EntityTypeDefinition!

- Inconsistency will cause problems

2. **No Document-Level Dedup**

- Plan focuses on entity/edge dedup

- Missing: Track which documents have been processed

- Suggestion:

```python

class ProcessedDocument(Base):

doc_id = Column(String, unique=True) # Prevent re-processing

content_hash = Column(String)

processed_at = Column(DateTime)

```

3. **Performance Concerns**

- Check-then-merge in Python is slow (2+ queries per entity)

- Should use raw SQL with ON CONFLICT for bulk ops

## Testing the Fix

### Verify Deduplication Works

# 1. Ingest same email twice
curl -X POST https://atom-saas.fly.dev/api/integrations/outlook/sync \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Tenant-Id: $TENANT_ID" \
  -d '{
    "start_date": "2024-01-01T00:00:00Z",
    "end_date": "2024-01-02T00:00:00Z"
  }'

# 2. Check database for duplicates
cd backend-saase
python3 <<EOF
from core.database import SessionLocal
from core.models import GraphNode

session = SessionLocal()
duplicates = session.query(
    GraphNode.name, GraphNode.type, func.count(GraphNode.id)
).group_by(
    GraphNode.name, GraphNode.type
).having(
    func.count(GraphNode.id) > 1
).all()

for name, type, count in duplicates:
    print(f"DUPLICATE: {name} ({type}) - {count} instances")
EOF

### Expected Result After Fix

- No duplicate entries

- Second ingestion should UPDATE existing node

- Should see "Updated existing node" in logs (debug level)

## Next Steps

1. ✅ **Immediate Fix** - Applied (check-then-update logic)

2. **High Priority** - Add unique constraint migration

3. **Medium Priority** - Implement content hashing to skip unnecessary updates

4. **Low Priority** - Migrate to ON CONFLICT for performance

## Files Modified

- ✅ backend-saas/core/graphrag_engine.py (lines 1155-1185)

- ⏳ Migration to add unique constraints (TODO)

- ⏳ ON CONFLICT implementation (TODO)

## Deployment

Commit message ready:

git add backend-saas/core/graphrag_engine.py
git commit -m "fix: implement GraphRAG entity deduplication to prevent duplicate nodes

CRITICAL BUG FIX: Prevents duplicate GraphNode creation on repeated ingestion.

Changes:
- Check for existing nodes by (workspace_id, name, type) before inserting
- Update existing nodes with new description/properties instead of creating duplicates
- Add debug logging for tracking creates vs updates

Root Cause: Previous implementation blindly created new nodes on every ingestion,
causing database bloat and performance degradation with duplicate entities.

Test: Ingest same email twice - should see 1 GraphNode, not 2.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Run tests before deploying to verify.